Looking over previous homeworks to identify different forecasting approaches
Box-cox transformations
decomposition
# Import required R libraries
library(fpp3)
#library(tidyverse)
library(readxl)
This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday October 31. I will accept late submissions with a penalty until the meetup after that when we review some projects.
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable Cash is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an Excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
# Read in data
atm_data_raw <- read_excel("data/ATM624Data.xlsx")
# Initial output to see data
head(atm_data_raw)
## # A tibble: 6 × 3
## DATE ATM Cash
## <dbl> <chr> <dbl>
## 1 39934 ATM1 96
## 2 39934 ATM2 107
## 3 39935 ATM1 82
## 4 39935 ATM2 89
## 5 39936 ATM1 85
## 6 39936 ATM2 90
summary(atm_data_raw)
## DATE ATM Cash
## Min. :39934 Length:1474 Min. : 0.0
## 1st Qu.:40026 Class :character 1st Qu.: 0.5
## Median :40118 Mode :character Median : 73.0
## Mean :40118 Mean : 155.6
## 3rd Qu.:40210 3rd Qu.: 114.0
## Max. :40312 Max. :10919.8
## NA's :19
dim(atm_data_raw)
## [1] 1474 3
# 1474 3
#Cash in hundreds
# Define as tsibble
atm_data_ts <- atm_data_raw %>%
as_tsibble(index = DATE, key = ATM)
library(seasonal)
atm1_data_ts <- atm_data_ts %>%
filter(ATM == 'ATM1')
summary(atm1_data_ts)
## DATE ATM Cash
## Min. :39934 Length:365 Min. : 1.00
## 1st Qu.:40025 Class :character 1st Qu.: 73.00
## Median :40116 Mode :character Median : 91.00
## Mean :40116 Mean : 83.89
## 3rd Qu.:40207 3rd Qu.:108.00
## Max. :40298 Max. :180.00
## NA's :3
atm1_data_ts
## # A tsibble: 365 x 3 [1]
## # Key: ATM [1]
## DATE ATM Cash
## <dbl> <chr> <dbl>
## 1 39934 ATM1 96
## 2 39935 ATM1 82
## 3 39936 ATM1 85
## 4 39937 ATM1 90
## 5 39938 ATM1 99
## 6 39939 ATM1 88
## 7 39940 ATM1 8
## 8 39941 ATM1 104
## 9 39942 ATM1 87
## 10 39943 ATM1 93
## # … with 355 more rows
atm1_data_ts %>%
autoplot(Cash)
# Calculate median value for ATM1
median <- median(atm1_data_ts$Cash, na.rm=TRUE)
# Set NAs to median
atm1_data_ts$Cash[is.na(atm1_data_ts$Cash)] <- median
summary(atm1_data_ts)
## DATE ATM Cash
## Min. :39934 Length:365 Min. : 1.00
## 1st Qu.:40025 Class :character 1st Qu.: 73.00
## Median :40116 Mode :character Median : 91.00
## Mean :40116 Mean : 83.95
## 3rd Qu.:40207 3rd Qu.:108.00
## Max. :40298 Max. :180.00
head(atm1_data_ts)
## # A tsibble: 6 x 3 [1]
## # Key: ATM [1]
## DATE ATM Cash
## <dbl> <chr> <dbl>
## 1 39934 ATM1 96
## 2 39935 ATM1 82
## 3 39936 ATM1 85
## 4 39937 ATM1 90
## 5 39938 ATM1 99
## 6 39939 ATM1 88
atm1_data_ts <- atm1_data_ts %>%
mutate(DATE = as_date(DATE)) %>%
update_tsibble(index = DATE)
atm1_data_ts
## # A tsibble: 365 x 3 [1D]
## # Key: ATM [1]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2079-05-03 ATM1 96
## 2 2079-05-04 ATM1 82
## 3 2079-05-05 ATM1 85
## 4 2079-05-06 ATM1 90
## 5 2079-05-07 ATM1 99
## 6 2079-05-08 ATM1 88
## 7 2079-05-09 ATM1 8
## 8 2079-05-10 ATM1 104
## 9 2079-05-11 ATM1 87
## 10 2079-05-12 ATM1 93
## # … with 355 more rows
# From: https://stats.stackexchange.com/questions/494013/control-the-period-for-daily-time-series-in-tsibbles
atm1_dcmp <- atm1_data_ts %>%
model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic")))
atm1_dcmp %>% components() %>% autoplot()
components(atm1_dcmp) %>%
as_tsibble() %>%
autoplot(Cash, colour = "gray") +
geom_line(aes(y=season_adjust), colour = "#0072B2") +
labs(y = "Cash (in hundreds)", title = "Seasonally Adjusted Trendline")
# from textbook: If the seasonal component is removed from the original data, the resulting values are the “seasonally adjusted” data.
# Create training set (365 total, perhaps assume 1 year of data)
# 292 days, yes, using the generating dates based on the integer values
train_atm1 <- atm1_data_ts %>%
filter_index("2079-05-03" ~ "2080-02-20")
train_atm1
## # A tsibble: 294 x 3 [1D]
## # Key: ATM [1]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2079-05-03 ATM1 96
## 2 2079-05-04 ATM1 82
## 3 2079-05-05 ATM1 85
## 4 2079-05-06 ATM1 90
## 5 2079-05-07 ATM1 99
## 6 2079-05-08 ATM1 88
## 7 2079-05-09 ATM1 8
## 8 2079-05-10 ATM1 104
## 9 2079-05-11 ATM1 87
## 10 2079-05-12 ATM1 93
## # … with 284 more rows
# Fit the models
fit_atm1 <- train_atm1 %>%
model(
Naive = NAIVE(Cash),
`Seasonal naive` = SNAIVE(Cash),
`Random walk` = RW(Cash ~ drift())
)
fit_atm1_snaive <- train_atm1 %>%
model(SNAIVE(Cash))
fit_atm1 %>% accuracy()
## # A tibble: 3 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM1 Naive Trai… 9.90e- 2 49.9 37.5 -84.8 120. 2.05 1.74 -0.360
## 2 ATM1 Seasonal naive Trai… 5.40e- 1 28.6 18.3 -82.8 103. 1 1 0.145
## 3 ATM1 Random walk Trai… 9.64e-16 49.9 37.5 -85.1 120. 2.05 1.74 -0.360
# Check the residuals of Seasonal Naive
fit_atm1_snaive %>% gg_tsresiduals()
# Generate forecasts for 73 days
fc_atm1 <- fit_atm1 %>% forecast(h = "71 days")
fc_atm1 %>% accuracy(atm1_data_ts)
## # A tibble: 3 × 11
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Naive ATM1 Test -44.8 56.7 45.7 -602. 602. 2.49 1.98 -0.0619
## 2 Random walk ATM1 Test -48.4 59.7 49.2 -619. 620. 2.68 2.08 -0.0425
## 3 Seasonal naive ATM1 Test -20.3 60.9 49.7 -483. 511. 2.71 2.13 0.281
# Plot forecasts against actual values
fc_atm1 %>%
autoplot(train_atm1, level = NULL) +
autolayer(
filter_index(atm1_data_ts, "2080-02-21" ~ "2080-05-01"),
colour = "black"
) +
labs(
y = "Cash (in hundreds)",
title = "Forecasts for ATM1 Withdrawals"
) +
guides(colour = guide_legend(title = "Forecast"))
# Result: Not good, SNAIVE at least attempts the seasonal nature
# ETS
fit_atm1_ets <- train_atm1 %>%
model(
add = ETS(Cash ~ error("A") + trend("N") + season("A")),
mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
)
# Forecast for 73 days
fc_atm1_ets <- fit_atm1_ets %>% forecast(h = "71 days")
fc_atm1_ets %>%
autoplot(filter_index(train_atm1, "2079-12-01" ~ "2080-05-01"), level = NULL) +
autolayer(
filter_index(atm1_data_ts, "2079-12-01" ~ "2080-05-01"),
colour = "black"
) +
labs(y="Billions $USD", title="GDP: China") +
guides(colour = guide_legend(title = "Forecast"))
# Forecast over the test set
# Measure the accuracy
# Forecast beyond the data
# NAIVE
# SNAIVE
# drift()
# ETS
# ARIMA
atm2_data_ts <- atm_data_ts %>%
filter(ATM == 'ATM2')
atm2_data_ts %>% autoplot(Cash)
# Calculate median value for ATM2
median <- median(atm2_data_ts$Cash, na.rm=TRUE)
# Set NAs to median
atm2_data_ts$Cash[is.na(atm2_data_ts$Cash)] <- median
summary(atm2_data_ts)
## DATE ATM Cash
## Min. :39934 Length:365 Min. : 0.0
## 1st Qu.:40025 Class :character 1st Qu.: 26.0
## Median :40116 Mode :character Median : 67.0
## Mean :40116 Mean : 62.6
## 3rd Qu.:40207 3rd Qu.: 93.0
## Max. :40298 Max. :147.0
atm2_data_ts %>%
model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic"))
) %>%
components() %>% autoplot()
atm3_data_ts <- atm_data_ts %>%
filter(ATM == 'ATM3', Cash > 0)
atm3_data_ts %>% autoplot(Cash)
summary(atm3_data_ts)
## DATE ATM Cash
## Min. :40296 Length:3 Min. :82.00
## 1st Qu.:40296 Class :character 1st Qu.:83.50
## Median :40297 Mode :character Median :85.00
## Mean :40297 Mean :87.67
## 3rd Qu.:40298 3rd Qu.:90.50
## Max. :40298 Max. :96.00
# Literally just 3 values above 0
atm4_data_ts <- atm_data_ts %>%
filter(ATM == 'ATM4')
atm4_data_ts %>% autoplot(Cash)
summary(atm4_data_ts)
## DATE ATM Cash
## Min. :39934 Length:365 Min. : 1.563
## 1st Qu.:40025 Class :character 1st Qu.: 124.334
## Median :40116 Mode :character Median : 403.839
## Mean :40116 Mean : 474.043
## 3rd Qu.:40207 3rd Qu.: 704.507
## Max. :40298 Max. :10919.762
atm4_data_ts %>%
model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic"))
) %>%
components() %>% autoplot()
Summary output DATE ATM Cash
Min. :39934 Length:1474 Min. : 0.0
1st Qu.:40026 Class :character 1st Qu.: 0.5
Median :40118 Mode :character Median : 73.0
Mean :40118 Mean : 155.6
3rd Qu.:40210 3rd Qu.: 114.0
Max. :40312 Max. :10919.8
NA’s :19
Date 39934 to 40312
379 dates
ATM1, ATM2, ATM3, ATM4, NA
Cash 19 NA’s
Data observations: ATM1: 3 Cash NA values ATM2: 2 Cash NA values ATM3: Only 3 Cash values with something above zero ATM4: Many Cash values with a decimal, but not all, something weird there, also ATM4 appears to have 1 really crazy outlier Final 14 entries are NA, NA (DATE of 40299 and higher are NA, NA)
Dimensions output [1] 1474 3
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
# Read in data
power_data_raw <- read_excel("data/ResidentialCustomerForecastLoad-624.xlsx")
# Change column name to 'Month'
names(power_data_raw)[names(power_data_raw) == 'YYYY-MMM'] <- 'Month'
head(power_data_raw)
## # A tibble: 6 × 3
## CaseSequence Month KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
summary(power_data_raw)
## CaseSequence Month KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
dim(power_data_raw)
## [1] 192 3
# 192 3
# KWH
power_data_ts <- power_data_raw %>%
mutate(Month = yearmonth(Month)) %>%
mutate(KWH = KWH/1e3) %>% # In thousands
as_tsibble(index = Month)
head(power_data_ts)
## # A tsibble: 6 x 3 [1M]
## CaseSequence Month KWH
## <dbl> <mth> <dbl>
## 1 733 1998 Jan 6863.
## 2 734 1998 Feb 5838.
## 3 735 1998 Mar 5421.
## 4 736 1998 Apr 5010.
## 5 737 1998 May 4665.
## 6 738 1998 Jun 6467.
power_data_ts %>%
autoplot(KWH)
power_data_ts %>%
filter_index("2010" ~ "2011") %>%
print()
## # A tsibble: 13 x 3 [1M]
## CaseSequence Month KWH
## <dbl> <mth> <dbl>
## 1 877 2010 Jan 9397.
## 2 878 2010 Feb 8391.
## 3 879 2010 Mar 7348.
## 4 880 2010 Apr 5776.
## 5 881 2010 May 4919.
## 6 882 2010 Jun 6696.
## 7 883 2010 Jul 771.
## 8 884 2010 Aug 7923.
## 9 885 2010 Sep 7819.
## 10 886 2010 Oct 5876.
## 11 887 2010 Nov 4801.
## 12 888 2010 Dec 6153.
## 13 889 2011 Jan 8395.
First observations, data is Monthly, so I’d expect a seasonal component. 1 month is missing KWH has 1 NA value (2008 Sep) Considering imputing with median Sep Value Outlier (with very small value in July 2010) Considering imputing with median Jul Value
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
#30#